Basic Protocol 1: Exploratory Data Analysis with pcaExplorer
Compiled date: 2021-09-07
Last edited: 2021-09-07
library("DESeq2")
library("topGO")
library("org.Mm.eg.db")
library("pcaExplorer")
library("ideal")
library("GeneTonic")
pcaExplorer is a Bioconductor package [https://doi.org/10.1186/gb-2004-5-10-r80] which can be seen as a general-purpose interactive companion tool for RNA-seq analyses. pcaExplorer is designed to guide the user in exploring the Principal Components (PC) [https://doi.org/10.2307/1270093] of the data under inspection. Besides the Principal Component Analysis (PCA) [https://doi.org/10.2307/1270093], pcaExplorer also provides tools to detect outlier samples, genes that show particular patterns and additionally provides a functional interpretation of the principal components for further quality assessment and hypothesis generation on the input data.
In this protocol we describe how to launch a Shiny application [https://CRAN.R-project.org/package=shiny] of pcaExplorer with the data of the macrophage dataset [https://doi.org/10.1038/s41588-018-0046-7] which is also distributed via Bioconductor [https://doi.org/10.1186/gb-2004-5-10-r80].
Hardware
a modern desktop computer or laptop with any up-to-date operating system
Software
R 3.3 or higher, Bioconductor 3.3 or higher, (optional?) RStudio, optional browser to open vignettes
Files (maybe we can describe the sample input here and how to download from Github, I’ve seen that in some papers rather than describing the general input format)
pcaExplorer mainly requires 3 input files in text format. The files are expected to be tab-separated, but also comma- or semicolon-separated files are accepted (see Alternative protocol 1). The first input file is the count matrix which stores the number of times (i.e counts) a certain feature (e.g gene) is found in each sample. In the count matrix, the samples are stored in the columns, while the rows store the individual features (see Fig. 1).
Figure 1 Example of a count matrix - The figure shows the example of a count matrix, with the individual features (here genes) in the rows and the individual samples in the columns. The matrix represents how many times each feature was found in each sample. The shown count matrix is comma-separated, but also semicolon- and tab-separated inputs are accepted (see Alternative protocol 1)
The second input of pcaExplorer is the metadata file. This file stores for each sample the necessary experimental variables. The individual samples represent the rows of the file while the columns save the different experimental variables (see Fig. 2).
Lastly, the third input of pcaExplorer is optional, but highly recommended for an ease of interpretation of the results. The last input is the annotation file. The file contains the feature ids of the count matrix in the rows and at least one column called gene name which contains a more human readable form of the feature ids (e.g. HGNC gene names [https://doi.org/10.1093/nar/gkaa980] if the features are gene ids). Fig. 3 shows an example of an annotation file.
Figure 3 Example of an annotation file - The annotation file contains a row for each of the features of the count matrix. Furthermore, the file contains at least one column called gene names which contains a more human readable for of the feature identifiers of the rows. In the shown example the feature ids are ENSEMBL ids while the column gene names contains HGNC gene names.
Exploring the data with pcaExplorer
Before we start with the exploration of the data, the necessary packages and dependencies need to be installed and loaded. Support Protocol 1 describes how to install and load the packages.
pcaExplorer(countmatrix, metadata, annotation) where countmatrix, metadata and annotation have to be substituted by the file paths of the respective input files. To launch the application, enter the command into the console of RStudio and press the Enter-Button. This should launch a second window with the pcaExplorer application. In this application you should see the Data Upload panel as shown in Fig. 4.Figure 4 Data Upload panel - The figure shows the Data Upload panel which is the first panel a user sees upon launching pcaExplorer. The panel provides previes on the input data as well as buttons to interact with the application.
Click the “Generate the dds and dst object” button to generate the dds and dst object needed for the exploration of the data (Fig 4). Scroll down on the Data Upload panel. After the generation of the dds and dst object you should find the ‘Select one of the following transformations for your data:’ option with three blue colored buttons underneath. Each button describes a different transformation of the data. The first button on the far left ‘Compute variance stabilized transformed data from the dds object’ computes a variance stabilized transformed version on the data upon clicking the button. The middle button computes a regularized logarithm transformed version of the data, while the button on the far right computes log2 data. The choice of transformation of the data in this step should be dependent on the input data. If you follow this protocol using the provided sample data, click the ‘Compute variance stabilized transformed data from the dds object’ button on the far left.
At the bottom of the Data Upload panel, a preview of the input data is provided. Here you’ll find four green colored buttons which provide a preview on the respective input data upon clicking.
Navigate to the Counts Table panel through clicking on the panel name in the panel list at the beginning of each panel (Fig. 5). In this panel, the count information of the count matrix is shown in a table. A drop-down menu at the beginning of the panel provides the possibility to change the data scale in the table. Different options can be chosen through clicking (e.g. raw counts, normalized counts, regularized logarithm transformed counts, etc.). Users can download the counts table through clicking on the green download button below the table (Fig. 5).
Figure 5 Counts Table panel - The figure shows the Counts Table panel. This panel provides an overview of the input counts matrix.
Scroll down until you see the ‘Sample to sample scatter plots’ heading (Fig. 5). Choose ‘pearson’ as correlation method from the ‘Correlation method palette’. Ensure that both options ‘Use log2 values for plot axes and values’ and ‘Use a subset of max 1000 genes (quicker to plot)’ are selected. An option is selected, if the small box in front of the option is ticked (Fig. 5). Click on the ‘Run’ button to generate the scatter plots.
TODO: Something about the inspection of the plot and some general remarks to the type of plots
Figure 6
Figure 7 Sample to Sample similarity heatmap
Figure 8